Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Extending*the Stata capacity to show numbers

    Dear Profs and colleagues,

    I am going to sum up all values in one variable,total_sale. I used this code to reach the accumulated amount.
    Code:
    egen sum_sale= total(SV606501)
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input double total_sale float sum_sale
      51892 628288651264
     538892 628288651264
      96788 628288651264
          0 628288651264
          0 628288651264
       9112 628288651264
       2380 628288651264
          0 628288651264
      31511 628288651264
     152561 628288651264
          0 628288651264
       2829 628288651264
          0 628288651264
      81294 628288651264
          0 628288651264
      89601 628288651264
     119800 628288651264
     281376 628288651264
     125084 628288651264
    The variable is firms sales so the quantities are massive. The state says that the final accumulation is -81294 628288651264 - but I am sure that it's more than this amount, I guess after some limited number the state does not add more values. Am I right? If so, what's the solution to reach the real number?
    Thank you for your cooperation.

    Cheers,
    Paris

  • #2
    Use a double variable to hold really big numbers. But if all you want is a single sum, summarize will do it:

    Code:
    . sysuse auto, clear
    (1978 automobile data)
    
    . su weight, meanonly
    
    . di r(sum)
    223440
    See the help for summarize which documents saved results. r(sum) is not shown by default but it is accessible once summarize has finished.

    Comment


    • #3
      The one thing I am sure of is that you must do this sum as -egen double sumsale = total(SV606501)-. If you do not specify -double-, then the resulting variable will not be able to represent all of the digits in the sum and will start truncating the result. Of this I am absolutely certain. Whether this, alone, will solve your problem I cannot say. This would not lead to the total looking as if it had just stopped accumulating; it would just mean that a few low order digits would be inaccurate. I should also point out that even with -double-, which is the largest storage type available in Stata, only 16 digits of precision are possible. If your exact grand total has more than 16 significant figures, then it is not possible to store it as a numeric variable in Stata without some loss of information.

      Since you say that it looks like the total just stopped accumulating altogether, there is another possibility, that may arise if the _N in your data set is very large. It is possible that at some point the accumulated total becomes so large that adding in the next value of SV606501 has no effect because it is so many orders of magnitude smaller than the accumulated total that it becomes, literally, a "rounding error" when Stata tries to add it. A way to mitigate this is to add up the numbers starting from smallest to largest. That way the total accumulates more gradually, and by the time the total is really big, so are the numbers being added to it. So:
      Code:
      sort SV606501
      gen sumsale = sum(SV606501)
      replace sumsale = sumsale[_N]
      Added: Crossed with #2.
      Last edited by Clyde Schechter; 20 Feb 2023, 15:15.

      Comment


      • #4
        Here are the limits on storage of decimal integers with full accuracy in the various numeric storage types. The fixed-point variables lose the 27 largest positive values to missing value codes; the similar loss for floating point variables occurs only for the largest exponent, so it doesn't affect the much smaller integer values.
        byte - 7 bits -127 100
        int - 15 bits -32,767 32,740
        long - 31 bits -2,147,483,647 2,147,483,620
        float - 24 bits -16,777,216 16,777,216
        double - 53 bits -9,007,199,254,740,992 9,007,199,254,740,992

        Comment


        • #5
          Thank you, prof Nick and prof Clyde,

          I used the three mentioned approaches, The results are below.

          Code:
           egen double sumsale = total( total_sale )
          
          * Example generated by -dataex-. For more info, type help dataex
          clear
          input double(total_sale sumsale)
             3378 628288630485
            17750 628288630485
                0 628288630485
            95449 628288630485
           217467 628288630485
           110950 628288630485
           102081 628288630485
                0 628288630485
                0 628288630485
                0 628288630485
           110948 628288630485
          Code:
          su total_sale , meanonly
          
          .di r(sum)
          6.283e+11
          =(628,300,000,000)
          Code:
           sort total_sale
          
          . gen sumsale = sum(total_sale)
          
          . replace sumsale = sumsale[_N]
          
          
          input double total_sale float sumsale
                0 628288651264
              147 628288651264
            92500 628288651264
                0 628288651264
            14951 628288651264
             2030 628288651264
                0 628288651264
          1966782 628288651264
                0 628288651264
            30253 628288651264
                0 628288651264
           329115 628288651264
                0 628288651264
          1253306 628288651264
           103356 628288651264
            17730 628288651264
           134827 628288651264
                0 628288651264
                0 628288651264
            53360 628288651264
           103359 628288651264
                0 628288651264
                0 628288651264
                0 628288651264
            48985 628288651264
          Do you think which one shall I pick up?

          Comment


          • #6
            Well, it seems likely that the actual sum is a 12-digit number beginning with a 6.

            Two things we can learn.

            1) The result must be stored in a double or precision will be lost.

            2) Once that is said, then sorting - as Clyde suggested - is not necessary, because all the intermediate values will be stored to full precision,

            The following example leads to some conclusions.
            Code:
            sysuse auto, clear
            keep price
            summarize price, meanonly
            display r(sum)
            
            // we need bigger numbers to reach a 12-digit total
            replace price = price*1000000
            
            summarize price, meanonly
            display r(sum)
            display %20.0fc r(sum)
            
            generate float f_sum = sum(price)
            replace f_sum = f_sum[_N]
            generate double e_sum = sum(price)
            replace e_sum = e_sum[_N]
            egen float f_total = total(price)
            egen double d_total = total(price)
            
            format %20.0fc *_*
            list *_* in l, noobs clean
            Code:
            . sysuse auto, clear
            (1978 automobile data)
            
            . keep price
            
            . summarize price, meanonly
            
            . display r(sum)
            456229
            
            .
            . // we need bigger numbers to reach a 12-digit total
            . replace price = price*1000000
            variable price was int now double
            (74 real changes made)
            
            .
            . summarize price, meanonly
            
            . display r(sum)
            4.562e+11
            
            . display %20.0fc r(sum)
                 456,229,000,000
            
            .
            . generate float f_sum = sum(price)
            
            . replace f_sum = f_sum[_N]
            (73 real changes made)
            
            . generate double e_sum = sum(price)
            
            . replace e_sum = e_sum[_N]
            (73 real changes made)
            
            . egen float f_total = total(price)
            
            . egen double e_total = total(price)
            
            .
            . format %20.0fc *_*
            
            . list *_* in l, noobs clean
            
                          f_sum             e_sum           f_total           e_total  
                456,228,995,072   456,229,000,000   456,228,995,072   456,229,000,000  
            
            .
            The first group highlighted in red tells us that the summarize command result r(total) yields a correct result when displayed with a suitable format.

            Now, both generate ... sum() and egen ... total() do their calculations in double and yield the correct result when a double variable is created, but not when a float variable is created.

            So the important lesson is that whichever of the three techniques you choose, if you are going to store the results in a variable, that variable - for this data - must be a double to hold the result in full precision.
            Code:
            summarize total_sale , meanonly
            generate double sumsale = r(total)
            
            generate double sumsale = sum(total_sale)
            replace sumsale = sumsale[_N]
            
            egen double sumsale = total(total_sale)

            Comment

            Working...
            X